fix join column handling logic for `On` and `Using` constraints #605

houqp · 2021-06-23T05:54:25Z

Which issue does this PR close?

Closes #601.
Closes #671.

Also fixed a bug where index_of_column won't error out on duplicated fields.

Rationale for this change

In MySQL and Postgres, Join with On constraints produces output with join column from both relations preserved. For example:

SELECT * FROM test t1 JOIN test t2 ON t1.id = t2.id;

produces:

id	id
1	1
2	2

While join with Using constraints deduplicates the join column:

SELECT * FROM test t1 JOIN test t2 USING (id);

produces:

id
1
2

However, in our current implementation, join column dedup is applied in all cases. This PR changes the behavior so it's consistent with MySQL and Postgres.

Here comes the annoying part.

Note that for join with Using constraint, users can still project join columns using relations from both sides. For example SELECT t1.id, t2.id FROM test t1 JOIN test t2 USING (id) produces the same output as SELECT * FROM test t1 JOIN test t2 ON t1.id = t2.id.

This means for Using joins, we need to model a join column as a single shared column between both relations. Current DFField struct only allows a field/column to have a single qualifier, so I ended adding a new shared_qualifiers field to DFField struct to handle this edge-case. Our logical plan builder will be responsible for setting this field when building join queries with using constraints. During query optimization and planning, the shared_qualifiers field is used to look up column by name and qualifier.

Other alternatives include changing the qualifer field of DFField to an option of enum to account for single qualifier and shared qualifiers. None of these approaches felt elegant to me. I am curious if anyone has ideas or suggestions on how to better implement this behavior.

What changes are included in this PR?

Expose JoinConstraints to ballista.proto
Unify JoinType enum between logical and physical planes
Added context execution tests to enforce semantics for On and Using joins
Refactored dfschema module to use index_of_column_by_name in both index_of_column and field_with_qualified_name methods.

Are there any user-facing changes?

Join with On constraints will now output joined columns from both relations without deduplication. Dedup is only applied to join with Using constraints.

houqp · 2021-06-23T06:07:27Z

datafusion/src/physical_plan/hash_utils.rs

 use crate::physical_plan::expressions::Column;

-/// All valid types of joins.
-#[derive(Clone, Copy, Debug, Eq, PartialEq)]
-pub enum JoinType {


reuse the same enum from logical plane.

alamb

Thank you @houqp -- I think this PR is an improvement.

Also, I did some more testing and there are still cases that don't appear to be working (which I think would be fine to do as a follow on PR) -- they no longer panic :)

echo "1" > /tmp/foo.csv
cargo run -p datafusion-cli

> 
CREATE EXTERNAL TABLE foo(bar int)
STORED AS CSV
LOCATION '/tmp/foo.csv';
0 rows in set. Query took 0.000 seconds.
> select * from foo as f1 JOIN foo as f2 ON f1.bar = f2.bar;
Plan("Schema contains duplicate unqualified field name 'bar'")

> select f1.bar, f2.bar from foo as f1 JOIN foo as f2 ON f1.bar = f2.bar;
Plan("Schema contains duplicate unqualified field name 'bar'")

> select f1.bar, f2.bar from foo as f1 JOIN foo as f2 USING(bar);
NotImplemented("Unsupported compound identifier '[\"f1\", \"bar\"]'")

> select * from foo as f1 JOIN foo as f2 USING(bar);
+-----+
| bar |
+-----+
| 1   |
+-----+

alamb · 2021-06-23T13:49:44Z

datafusion/src/execution/context.rs

@@ -1259,6 +1259,96 @@ mod tests {
        Ok(())
    }

+    #[tokio::test]
+    async fn left_join_using() -> Result<()> {


👍 definitely an improvement

datafusion/src/logical_plan/dfschema.rs

alamb · 2021-06-23T14:01:38Z

@houqp

The approach we seem to be taking is to try and keep the information about where a column came from in a join -- e.g. that a output DF field could be referred to as either f1.foo or f2.foo for example, which is getting complicated

It seems like a core challenge is in the semantics of the * expansion in select * from ... type queries which varies depending on type of join. Have you considered perhaps focusing in that area rather than trying to track the optional qualifiers through the plans?

So for example, in a f1 JOIN f1 have the output schema contain both f1.foo and f2.foo but then change the expansion of * to have somethign like f1.foo as foo?

houqp · 2021-06-23T18:56:38Z

Good catch on the regressions, I will get them fixed tonight with unit tests.

It seems like a core challenge is in the semantics of the * expansion in select * from ... type queries which varies depending on type of join.

Yeah, that's exactly the problem.

Initially, I tried the approach of keeping the join columns from both relations (i.e. f1.foo and f2.foo) in the logical schema. Then I ran into the problem of it breaking the physical planning invariants. Because for Using join, we are only producing a single join column in the output, the physical plan schema can only has a single field for the join column. The schema from the logical plan will need to match that as well in order to honor that invariant.

alamb · 2021-06-23T21:22:00Z

Because for Using join, we are only producing a single join column in the output

is this something we can change (aka update the using join to produce columns and then rely on projection pushdown to prune the uneeded ones out?)

houqp · 2021-06-23T22:22:36Z

is this something we can change (aka update the using join to produce columns and then rely on projection pushdown to prune the uneeded ones out?)

That's a good point, the single column output semantic doesn't need to be enforced at the join node level. let me give this a try as well 👍 This could simplify our join handling logic. I will also double check to see if this would result in nontrivial runtime overhead for us.

houqp · 2021-06-24T05:17:35Z

@alamb, I wasn't able to reproduce the errors you showed in #605 (review)

Here is what I got:

> CREATE EXTERNAL TABLE foo(bar int)
STORED AS CSV
LOCATION '/tmp/foo.csv';
0 rows in set. Query took 0.001 seconds.
> select f1.bar, f2.bar from foo as f1 JOIN foo as f2 ON f1.bar = f2.bar;
+-----+-----+
| bar | bar |
+-----+-----+
| 3   | 3   |
| 1   | 1   |
| 4   | 4   |
| 2   | 2   |
+-----+-----+
4 rows in set. Query took 0.022 seconds.
>  select * from foo as f1 JOIN foo as f2 ON f1.bar = f2.bar;
+-----+-----+
| bar | bar |
+-----+-----+
| 3   | 3   |
| 1   | 1   |
| 4   | 4   |
| 2   | 2   |
+-----+-----+
4 rows in set. Query took 0.022 seconds.
> select f1.bar, f2.bar from foo as f1 JOIN foo as f2 USING(bar);
+-----+-----+
| bar | bar |
+-----+-----+
| 2   | 2   |
| 3   | 3   |
| 1   | 1   |
| 4   | 4   |
+-----+-----+
4 rows in set. Query took 0.020 seconds.
>

Perhaps you rain those tests with a different build?

alamb · 2021-06-24T14:46:12Z

Perhaps you rain those tests with a different build?

I am not sure what I tested with, but I re-ran the tests at b8da356 and everything is working -- sorry for the noise.

alamb

Given this PR is better than master (aka it doesn't panic) I would be fine merging it in as is and iterating on the design subsequently. Alternately, if you plan to make more changes as part of this PR @houqp I can wait to merge it in

houqp · 2021-06-24T16:23:38Z

let's wait for my alternative solutions and review them together before the merge unless there is urgent issue in master that this PR addresses. I would like to avoid merging in a premature abstraction to reduce noise :)

alamb · 2021-06-27T10:54:01Z

Marking as draft so it is clearer from the list of PRs that this one is not quite ready to go (and thus I don't merge it accidentally)

get rid of shared field and move column expansion logic into plan builder and optimizer.

houqp · 2021-07-04T04:38:00Z

ballista/rust/core/src/serde/logical_plan/from_proto.rs

-                    protobuf::JoinType::Full => JoinType::Full,
-                    protobuf::JoinType::Semi => JoinType::Semi,
-                    protobuf::JoinType::Anti => JoinType::Anti,
-                };


conversion handled by shared method in serde/mod.rs

houqp · 2021-07-04T04:43:28Z

datafusion/src/logical_plan/builder.rs

-                }
-            };
-
+        JoinType::Inner | JoinType::Left | JoinType::Full | JoinType::Right => {


Join schemas are now consistent across all join types and constraints. We implement the join column "merge" semantic externally in plan builder and optimizer.

houqp · 2021-07-04T04:45:23Z

datafusion/src/logical_plan/dfschema.rs

-            if field.qualifier() == col.relation.as_ref() && field.name() == &col.name {
-                return Ok(i);
-            }
+    fn index_of_column_by_name(


Refactored this code into its own helper method to reduce duplicated code between field_with_qualified_name and index_of_column. This should also make index_of_column more robust.

houqp · 2021-07-04T04:47:41Z

datafusion/src/logical_plan/expr.rs

@@ -1118,36 +1133,56 @@ pub fn columnize_expr(e: Expr, input_schema: &DFSchema) -> Expr {
    }
 }

+/// Recursively replace all Column expressions in a given expression tree with Column expressions
+/// provided by the hash map argument.
+pub fn replace_col(e: Expr, replace_map: &HashMap<&Column, &Column>) -> Result<Expr> {


this is used in predicate push down optimizer to push predicates to both sides of the join clause.

use pub(crate)?

houqp · 2021-07-04T04:48:32Z

datafusion/src/logical_plan/expr.rs

-pub fn normalize_col(e: Expr, schemas: &[&DFSchemaRef]) -> Result<Expr> {
-    struct ColumnNormalizer<'a, 'b> {
-        schemas: &'a [&'b DFSchemaRef],
+pub fn normalize_col(e: Expr, plan: &LogicalPlan) -> Result<Expr> {


Change from schema to logical plan so we can extract join columns for join clauses with using constraints.

Note I wrote some tests that will need to be adjusted in #689 but that is no big deal

houqp · 2021-07-04T05:08:50Z

datafusion/src/physical_plan/hash_utils.rs

-    on: JoinOnRef,
-    join_type: &JoinType,
-) -> Schema {
+pub fn build_join_schema(left: &Schema, right: &Schema, join_type: &JoinType) -> Schema {


same schema build simplification as the one we introduced in logical plan builder.

houqp · 2021-07-04T05:10:05Z

datafusion/src/sql/planner.rs

@@ -560,7 +560,8 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                //   SELECT c1 AS m FROM t HAVING c1 > 10;
                //   SELECT c1, MAX(c2) AS m FROM t GROUP BY c1 HAVING MAX(c2) > 10;
                //
-                resolve_aliases_to_exprs(&having_expr, &alias_map)
+                let having_expr = resolve_aliases_to_exprs(&having_expr, &alias_map)?;
+                normalize_col(having_expr, &projected_plan)


needed if having expression referenced using join columns with unqualified column name.

houqp · 2021-07-04T06:58:34Z

@alamb reimplemented the logic based on your suggestion in #605 (comment).

Turns out there are many more edge-cases that need to be handled for using join other than wildcard expansion:

predicate push down on join columns
normalize unqualified column expressions that reference join columns with qualifiers

I have implemented support for all these edge-cases, but decided to leave out the wildcard expansion change as a follow up PR to keep the diff easier to review.

UPDATE: filed #678 as follow up.

jimexist

LGTM

alamb · 2021-07-05T11:09:37Z

Thanks @houqp -- I'll try and review this later today but I may run out of time in which case I'll get it done tomorrow

jimexist · 2021-07-06T13:08:11Z

datafusion/src/logical_plan/dfschema.rs

+                // field to lookup is qualified but current field is unqualified.
+                (Some(_), None) => false,
+                // field to lookup is unqualified, no need to compare qualifier
+                _ => field.name() == name,


prefer to write out cases, (None, None) and (None, Some(_) and union them. this makes it clearer?

jimexist · 2021-07-06T13:10:40Z

datafusion/src/logical_plan/expr.rs

+        for schema in &schemas {
+            let fields = schema.fields_with_unqualified_name(&self.name);
+            match fields.len() {
+                0 => continue,


how would this be possible?

We are iterating through schemas from all plan nodes in the provided plan tree, each plan node could have different schemas, so when we do the fields_with_unqualified_name look up, some of these plan nodes will no contain a field that matches self.name. We just pick the first plan node that contains schema field matches the unqualified name.

jimexist · 2021-07-06T13:13:26Z

datafusion/src/logical_plan/plan.rs

+                        on.iter()
+                            .map(|entry| {
+                                std::iter::once(entry.0.clone())
+                                    .chain(std::iter::once(entry.1.clone()))


why not just flatmap with a vec![entry.0.clone(), entry.1.clone()] - it's cleaner

i wanted to avoid an extra memory allocation incurred by Vec::new, but i will double check to see if once chain is actually generating the optimal code without memory allocations.

@jimexist sent #704 to address this. I found a comprise that's more readable and avoids the extra memory allocation.

Here is my test to compare the code gen with different approaches: https://godbolt.org/z/j3dcjbnvM.

jimexist · 2021-07-06T13:13:55Z

datafusion/src/optimizer/filter_push_down.rs

+        .map(|f| {
+            std::iter::once(f.qualified_column())
+                // we need to push down filter using unqualified column as well
+                .chain(std::iter::once(f.unqualified_column()))


same as https://github.com/apache/arrow-datafusion/pull/605/files#r664543206

jimexist · 2021-07-06T13:14:00Z

datafusion/src/optimizer/filter_push_down.rs

        .collect::<HashSet<_>>();
    let right_columns = &right
        .fields()
        .iter()
-        .map(|f| f.qualified_column())
+        .map(|f| {
+            std::iter::once(f.qualified_column())


same as https://github.com/apache/arrow-datafusion/pull/605/files#r664543206

alamb

I think this looks good @houqp -- I went through the test changes and code carefully and I think this is good to go.

@Dandandan I am not sure if you want to take a look at this given it changes how the Join relations are encoded.

alamb · 2021-07-06T17:07:55Z

datafusion/src/logical_plan/dfschema.rs

+            })
+            .map(|(idx, _)| idx)
+            .collect();
+


it probably doesn't matter but you could avoid the Vec allocation by something like:

let matches = self....; match matches.next() { let name = matches.next() { None => // error about no field Some(name) { if matches.next().is_some() => // error about ambiguous reference else name }

Good suggestion, fixed in #703.

alamb · 2021-07-06T17:11:37Z

datafusion/src/logical_plan/expr.rs

-pub fn normalize_col(e: Expr, schemas: &[&DFSchemaRef]) -> Result<Expr> {
-    struct ColumnNormalizer<'a, 'b> {
-        schemas: &'a [&'b DFSchemaRef],
+pub fn normalize_col(e: Expr, plan: &LogicalPlan) -> Result<Expr> {


Note I wrote some tests that will need to be adjusted in #689 but that is no big deal

alamb · 2021-07-06T17:13:57Z

datafusion/src/logical_plan/plan.rs

@@ -354,6 +356,43 @@ impl LogicalPlan {
            | LogicalPlan::CreateExternalTable { .. } => vec![],
        }
    }
+
+    /// returns all `Using` join columns in a logical plan
+    pub fn using_columns(&self) -> Result<Vec<HashSet<Column>>, DataFusionError> {


FWIW this function feels like it might better belong in some sort of utils rather than a method on LogicalPlan -- perhaps https://github.com/houqp/arrow-datafusion/blob/qp_join/datafusion/src/optimizer/utils.rs#L50

@alamb this is used outside of the optimizers as well, for example, the logical plan builder. With that context, do you still think it should live in optimizer utils module?

alamb · 2021-07-06T17:15:11Z

datafusion/src/optimizer/filter_push_down.rs

@@ -232,6 +241,38 @@ fn split_members<'a>(predicate: &'a Expr, predicates: &mut Vec<&'a Expr>) {
    }
 }

+fn optimize_join(


alamb · 2021-07-06T17:17:45Z

datafusion/src/optimizer/filter_push_down.rs

@@ -901,20 +979,61 @@ mod tests {
            format!("{:?}", plan),
            "\
            Filter: #test.a LtEq Int64(1)\
-            \n  Join: #test.a = #test.a\
+            \n  Join: #test.a = #test2.a\


alamb · 2021-07-06T17:20:24Z

datafusion/src/physical_plan/hash_join.rs

-            "| 1  | 7  | 10 | 4  | 70 |",
-            "| 2  | 8  | 20 | 5  | 80 |",
-            "+----+----+----+----+----+",
+            "+----+----+----+----+----+----+",


these changes make sense to me

alamb · 2021-07-06T17:25:47Z

Not sure if this will conflict with #660

alamb · 2021-07-07T12:03:54Z

I am going to merge this PR in (as it conflicts with #689) -- I think we can make additional improvements as follow on PRs.

alamb · 2021-07-07T12:03:59Z

Thanks again @houqp

alamb · 2021-07-07T12:05:41Z

I also merge apache/master into this branch locally on my machine and re-ran the tests to verify there were no conflicts

alamb · 2021-07-07T12:16:31Z

Aaand it looks like I forgot to fetch apache/master on my machine prior to testing -- fixed in #694

houqp · 2021-07-10T19:49:24Z

Apologize for being too busy last week and haven't had the time to update my PR, I am going to send a new PR to address all the feedbacks shortly.

houqp force-pushed the qp_join branch from 64c28cb to ecf923f Compare June 23, 2021 06:04

houqp commented Jun 23, 2021

View reviewed changes

houqp force-pushed the qp_join branch from ecf923f to b8da356 Compare June 23, 2021 06:25

alamb reviewed Jun 23, 2021

View reviewed changes

alamb approved these changes Jun 24, 2021

View reviewed changes

alamb added the datafusion Changes in the datafusion crate label Jun 24, 2021

alamb marked this pull request as draft June 27, 2021 10:54

houqp added 2 commits June 27, 2021 22:48

fix join column handling logic for On and Using constraints

c97e550

handling join column expansion during USING JOIN planning

702a5a0

get rid of shared field and move column expansion logic into plan builder and optimizer.

houqp force-pushed the qp_join branch from b8da356 to 702a5a0 Compare July 4, 2021 03:02

github-actions bot added the ballista label Jul 4, 2021

Merge remote-tracking branch 'upstream/master' into qp_join

c462206

houqp marked this pull request as ready for review July 4, 2021 04:35

houqp requested a review from alamb July 4, 2021 04:35

houqp commented Jul 4, 2021

View reviewed changes

add more comments & fix clippy

ccbb844

houqp commented Jul 4, 2021

View reviewed changes

houqp added 2 commits July 3, 2021 22:14

add more comment

091bb75

reduce duplicate code in join predicate pushdown

e97b86a

houqp mentioned this pull request Jul 4, 2021

self-join followed by a projection with duplicated names results in panic #671

Closed

jimexist approved these changes Jul 5, 2021

View reviewed changes

houqp mentioned this pull request Jul 5, 2021

dedup using join column in wildcard expansion #678

Merged

jimexist reviewed Jul 6, 2021

View reviewed changes

alamb approved these changes Jul 6, 2021

View reviewed changes

alamb merged commit 18c581c into apache:master Jul 7, 2021

alamb mentioned this pull request Jul 7, 2021

Fix test output due to logical merge conflict #694

Merged

houqp deleted the qp_join branch July 10, 2021 20:04

This was referenced Jul 10, 2021

avoid iterator materialization in column index lookup #703

Merged

replace once iter chain with array::IntoIter #704

Merged

houqp added the api change Changes the API exposed to users of the crate label Jul 30, 2021

fix join column handling logic for On and Using constraints #605

fix join column handling logic for On and Using constraints #605

Conversation

houqp commented Jun 23, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jun 23, 2021

houqp commented Jun 23, 2021

alamb commented Jun 23, 2021

houqp commented Jun 23, 2021

houqp commented Jun 24, 2021

alamb commented Jun 24, 2021

alamb left a comment

Choose a reason for hiding this comment

houqp commented Jun 24, 2021 • edited Loading

alamb commented Jun 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp commented Jul 4, 2021 • edited Loading

jimexist left a comment

Choose a reason for hiding this comment

alamb commented Jul 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 6, 2021

alamb commented Jul 7, 2021

alamb commented Jul 7, 2021

alamb commented Jul 7, 2021

alamb commented Jul 7, 2021

houqp commented Jul 10, 2021

fix join column handling logic for `On` and `Using` constraints #605

fix join column handling logic for `On` and `Using` constraints #605

houqp commented Jun 23, 2021 •

edited

Loading

houqp commented Jun 24, 2021 •

edited

Loading

houqp commented Jul 4, 2021 •

edited

Loading

houqp Jul 6, 2021 •

edited

Loading